SUBMITTED  TO:  SPECIAL  ISSUE  OF  THE  JOURNAL  OF  PARALLEL  AND  DISTRIBUTED  COMPDTING 
ON  PARALLEL  I/O  SYSTEMS 


Recovery  Issues  in  Databases  Using  Redundant  Disk 


AD-A259  256 


Arrays^ 

Antoine  N.  Mourad^ 
VV.  Kent  Fuchs 
Daniel  G.  Saab 


I 


;  FC  I E- 
lAfU  2 1993 


Center  for  Reliable  and  High-Performance  Computing 
Coordinated  Science  Laboratory  _ 


1101  VV.  Springfield  .\ve 

Accesion  For 

University  of  Illinois 

NTIS 

CR.A&I 

k 

Urbana,  Illinois  61S01. 

DTIC 

i  AB 

U  i  ’  ' '  '*  0  I .  ( J  cl 

L_J 

June  22,  1992 

By 

Dki  lij 

dioC 

i-K 

.'..iL.biir 

/  Cc:los 

:  or 

Dist 

Abstract 

Redundant  disk  arrays  provide  a  way  for  achieving  rapid  recovery  from  media  fail¬ 
ures  with  a  relatively  low  storage  cost  for  large  scale  database  systems  requiring  high 
availability.  In  this  paper  we  propose  a  method  for  using  redundant  disk  arrays  to 
support  rapid  recovery  from  system  crashes  and  transaction  aborts  in  addition  to  their 
role  in  providing  media  failure  recovery.  .\  twin  page  scheme  is  used  to  store  the  parity 
information  in  the  array  so  that  the  time  for  transaction  commit  processing  is  not  de¬ 
graded.  Using  an  analytical  model  and  simulation  results,  we  show  that  the  proposed 
method  achieves  a  significant  increase  in  the  throughput  of  database  systems  using  re¬ 
dundant  disk  arrays  by  reducing  the  number  of  recovery  operations  needed  to  maintain 
the  consistency  of  the  database. 
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1  Introduction 


In  a  database  system,  rapid  recovery  may  l)e  necessary  for  restoring  tlie  database  to  a  consistent 
state  after  a  failure.  Several  types  of  failures  can  occur.  The  most  typical  are  transaction  aborts 
which  can  be  due  to  program  errors,  deadlocks,  or  can  be  user  initiated.  When  a  transaction 
aborts,  the  recovery  manager  has  to  restore  all  database  pages  modified  by  the  transaction  to  their 
previous  state.  The  second  type  of  failure  is  a  system  crash.  In  this  case  system  tables  maintained 
in  main  memory  are  lost.  The  recovery  mechanism  has  to  UNDO  all  updates  made  to  the  database 
by  transactions  that  were  active  when  the  crash  occurred  and  to  REDO  modifications  performed 
by  complete  transactions  and  not  yet  reflected  in  the  database  at  the  time  of  the  crash. 

Another  type  of  failure  is  media  failure.  One  common  way  to  deal  with  this  type  of  failure  is 
by  periodically  generating  archive  copies  of  the  database  and  by  logging  updates  to  the  database 
performed  by  committed  transactions  between  archive  copies  into  a  redo  log  file.  When  a  media 
failure  occurs  the  database  is  reconstructed  from  the  last  copy  and  the  log  file  is  used  to  apply 
all  updates  performed  by  transactions  that  committed  after  the  last  copy  was  generated.  In  such 
a  case,  a  media  failure  causes  significant  down  time  and  the  overhead  for  recovery  is  quite  high. 
For  large  systems,  e.g.,  with  over  50  disks,  the  mean  time  to  failure  (MTTF)  of  the  permanent 
storage  subsystem  can  be  less  than  25  days'.  Mirrored  disks  have  been  employed  to  provide  rapid 
media  recovery  [1].  However,  disk  mirroring  incurs  a  100%  storage  overhead  which  is  prohibitive 
for  many  applications.  Redundant  Disk  .Array  (RD.A)  organizations  [3.  10]  provide  an  alternative 
for  maintaining  reliable  storage.  However,  even  when  disk  mirroring  or  RDAs  are  used,  archiving 
and  redo  logging  may  still  be  necessary  to  protect  the  database  against  operator  errors  or  system 
software  design  errors. 

In  this  paper,  we  present  a  technique  that  exploits  the  redundancy  in  disk  arrays  to  support 
recovery  from  transaction  and  system  failures  in  aildition  to  |>roviding  fast  media  recovery.  This 
is  achieved  by  using  a  twin  jtage  scluMue  for  storing  tite  ]>arity  informat irrn  making  it  possible  to 

'As.siiming  an  .VITTF  of  tU.llOO  horns  for  cacli  <lisk 


keep  the  old  version  ol  the  paril\'  .iJone:  'viih  ihe  new  version.  The  uld  veisioii  ui  i  he  parity  is  used 
to  undo  updates  perlormed  Ijy  aborted  transactions  or  by  transactions  interrupted  i)y  a  system 
failure. 

In  Sections  2  and  3  we  briefly  review  several  techniques  for  transaction  recovery  in  database 
systems  and  discuss  two  RDA  organizations.  In  Section  4.  we  present  our  database  recovery  sclieme. 
The  results  of  our  performance  analysis  are  detailed  in  Section  5.  Finally,  in  Section  6.  we  show- 
some  simulation  results  using  data  from  a.  real  application. 

2  Recovery  Techniques 

Recovery  algorithms  typically  use  some  form  of  logging  or  shadowing.  In  the  logging  approach 
[4],  before  a  new  version  (after-image)  of  a  record  or  page  is  written  to  the  database,  a  copy  of 
the  old  version  (before-image)  is  placed  into  a  sequential  log  file.  If  a  transaction  aborts  or  the 
system  crashes,  the  log  file  is  analyzed  and  the  state  of  the  database  is  restored.  In  the  shadow-ing 
approach  the  update  of  a  page  is  placed  into  a  new  physical  page  on  disk  [6,  9].  The  physical  pages 
containing  the  old  versions  are  released  after  all  updates  of  the  committing  transaction  have  been 
written  to  disk.  One  problem  with  the  shadowing  approach  is  dynamic  mapping  since  it  requires 
maintaining  a  very  large  page  table  which  leads  to  high  I/O  overhead  during  normal  processing. 
.4nother  problem  is  the  disk  scrambling  effect  which  decreases  the  sequentiality  of  disk  accesses. 

In  describing  and  in  analyzing  our  method,  we  will  use  the  following  ta.\onomy  of  database 
recovery  algorithms  introduced  by  Haerder  and  Reuter  [5].  They  classify  recovery  algorithms  with 
respect  to  the  following  four  concepts: 

Propagation^  of  updates.  The  propagation  .strategy  can  be  ATOMIC  in  which  case  any  .set 
of  updated  pages  ran  l)e  propagated  to  the  databa.se  iti  one  alottiic  action.  Itt  t  lie  1  T'O.l/Zr’ case, 
propagation  of  updates  can  be  itttei  ttt|)led  by  a  syslotii  crash  atid  database  jtages  are  npdated-iti- 

^Propagation  to  the  database  nteaiis  dial  llie  luw  vcr.sion  is  visilile  to  liiglur  level  .-oltwaie  I  jidates  van  be 
written  to  disk  without  being  propagated  le  g.,  shadowing). 


place. 

Page  replacement.  Two  policies  can  be  used:  the  STEAL  policy  allows  pages  inodilied  by 
uncoimnitted  iransaclions  to  be  propagated  to  the  tlatabase  before  eiul-of-rraii.sactiou  i  EOT  i:  tiie 
opposite  policy  is  referred  to  as  -'STEAL.  .Mo  UNDO  recovery  is  necessary  with  a  ^STE.AL  policy. 

EOT  processing.  Two  categories  e.-cist:  the  FO/ZCE discipline  requires  all  pages  modified  by 
a  transaction  to  be  propagated  before  EOT;  the  opposite  discipline  is  called  FORCE. 

Checkpointing  Schemes.  Checkpointing  is  used  to  propagate  updates  to  the  database  in 
order  to  minimize  the  number  of  REDO  recovery  actions  to  be  performed  after  a  crash.  In  the 
Transaction  Oriented  Checkpointing  (TOC)  scheme,  a  checkpoint  is  generated  at  the  end  of  each 
transaction.  This  is  equivalent  to  using  the  TO /ZCE discipline  in  EOT-processing.  Two  other  types 
of  checkpoints  can  be  used:  Transaction  Consistent  Checkpoints  (TCC)  are  generated  during  quies¬ 
cent  periods  where  no  transactions  are  being  processed,  .Action  Consistent  Checkpoints  (.ACC)  are 
less  restrictive  and  require  that  no  update  statements  are  processed  during  checkpoint  generation. 

3  Redundant  Disk  Arrays 

3.1  Data  Striping 

Striped  disk  arrays  have  been  proposed  and  implemented  for  increasing  the  transff'’-  bandwidth  in 
high  performance  I/O  subsystems  [7.  8,  13].  In  order  to  allow  the  use  of  a  large  number  of  disks  in 
such  arrays  without  compromising  the  reliability  of  the  I/O  subsystem,  redundancy  is  sometimes 
included  in  the  form  of  parity  information.  Patterson  et  al.  [10]  have  presented  .several  possible 
organizations  for  Redundant  .Arrays  of  Inexpensive  Disks  (R.AID).  One  interesting  organization  is 
RAID  with  rotated  parity  in  which  Itlocks  of  data  are  interleaved  across  .V  disks  while  the  parity 
of  the  .V  Ijlocks  i.-,  wrilteii  on  the  .V  -p  I"'  disk.  The  parity  is  rolat('d  over  the  ^el  of  di.-ik.^  in  oiuer 
to  avoid  contention  on  the  parity  disk.  Figure  1  shows  the  array  organization  with  four  disks.  The 
organization  allows  hotli  large  (lull  stripe)  concurrent  acces.ses  or  small  (individual  disk)  accesses. 
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Figure  1:  RAID  wirli  rotated  parity  on  lour  disks. 


«•  x  >•  < 

:->.sx-Nn\sx  .-si  .xx  xx*x\v 


Figure  2:  Parity  striping  of  disk  arrays. 

In  this  paper,  we  concentrate  on  small  read/write  accesses.  For  a  small  write  access,  the  data  block 
is  read  from  the  relevant  disk  and  modified.  To  compute  the  new  parity,  the  old  parity  has  to  be 
read,  XORed  with  the  new  data  and  XORed  with  the  old  data.  Then  the  new  data  and  new  parity 
can  be  written  back  to  the  corresponding  disks.  .Stonebraker  et.  nl.  [l-l]  have  advocated  the  use  of 
a  RAID  organization  to  provide  high  availal)ilit.y  in  databa.se  systems. 

3.2  Parity  Striping 

Gray  et  al.  [3]  studied  ways  of  using  an  architecture  such  as  RAID  in  on-line  transaction  processing 
(OLTP)  systems.  They  found  that  becau.se  of  the  nature  of  I/O  requests  in  OLTP  systems,  namely  a 
large  number  of  small  acce.sses.  it  is  not  convenient  to  have  several  disks  servicing  the  same  request. 
Hence,  the  organization  shown  in  Figure  2  was  propo.sed.  It  is  referred  to  as  parity  striping.  It 
consists  of  reserving  an  area  for  parity  on  each  disk  and  writing  data  sequenti,ally  on  each  disk 
without  interleaving.  For  a  group  of  N  -|-  1  disks,  each  disk  is  divided  into  N  +  1  areas  one  of  these 
areas  on  each  uisk  is  reserved  for  parity  ami  the  other  are;is  contain  data.  .V  data  .[re<i>.  from  .V 
different  disks  are  grouped  together  in  a  jxirihi  tjrotip  and  their  parity  is  written  on  the  parity  area 


of  the  .V  +  1’'  disk. 


4  RDA-Based  Recovery 


In  the  remainder  of  this  paper,  we  consider  an  I/O  subsysteni  that  is  a  collection  ot  redundant  disk 
arrays.  The  organization  of  the  arrays  being  either  parity  striping  ''>r  data  striping  (R.kID  with 
rotated  parity).  In  the  case  of  data  striping  we  assume  that  a  large  striping  unit  is  used  in  order  to 
ensure  that  I/O  requests  will  typically  be  serviced  by  a  single  data  disk.  We  also  make  the  following 
assumptions:  Communication  between  main  memory  and  the  I/O  subsystem  is  performed  using 
fixed  size  pages;  Database  pages  are  updated  in  place  which  implies  that  propagation  is  ^ATOMIC: 
\  STEAL  policy  is  used  thus  allowing  modified  pages  to  be  propagated  before  EOT. 

4.1  General  Description  of  the  Approach 

RDA-based  recovery  makes  use  of  the  parity  information  present  in  the  disk  arrays  to  undo  updates 
performed  by  aborted  transactions.  However,  the  parity  is  not  sufficient  by  itself  to  undo  all  updates 
performed  by  an  aborted  transaction.  Updates  that  cannot  be  undone  using  the  parity  are  dealt 
with  using  one  of  the  traditional  recovery  schemes. 

A  page  parity  group  is  the  set  of  pages  that  share  the  same  parity  page.  In  the  following,  unless 
there  is  ambiguity,  we  will  use  the  term  parity  group  to  denote  a  page  parity  group.  .\  parity  group 
can  be  in  one  of  two  states:  clean  or  dirty.  A  parity  group  is  dirty  when  one  of  its  data  pages  has 
been  modified  by  a  transaction  and  the  modified  version  has  been  written  back  to  the  database 
before  the  transaction  modifying  it  commits  (using  the  notation  of  Haerder  and  Reuter,  the  page 
has  been  stolen  from  tlie  buffer).  Otherwise  the  parity  group  is  called  clean.  Only  one  modified  data 
page  per  parity  group  can  be  written  Itack  to  the  database  by  uncommitted  transactions  without 
UNDO  logging.  If  additional  pages  in  the  parity  group  have  been  modified  and  need  to  be  written 
back  to  the  database  then  their  before-images  must  be  logged  first.  .A  dirty  parity  group  goes  back 
to  the  clean  state  when  the  transaction  that  caused  it  to  i)ecome  dirty  commits.  Figure  .1  shows 
the  state  transition  diagram  of  a  parity  group.  .A  table  in  main  memory  contains  the  numbers  of  all 
parity  groups  that  are  in  the  diiiy  stat(’.  It  also  contains  the  numix'r  ot  the  data  |);»ge  within  the 

() 


Traasactioii  T  modifies  page  D,  and  D,  is 
written  back  to  rite  database  before  EOT 


T  rereferences  D,. 
modifies  it  and  D, 
is  written  back  to 
the  database 
before  EOT 

Transaction  T  commits 

Figure  3:  State  transition  diagram  of  a  page  parity  group. 

group  that  caused  the  group  to  be  in  the  dirty  state  and  the  number  of  the  parity  page  holding  the 
updated  parity.  Only  log  .V  bits  need  to  be  used  to  store  the  data  page  number  and  one  bit  for  the 
parity  page  number.  The  table  is  used  to  check  whether  a  page  updated  by  an  active  transaction 
can  be  written  back  to  disk  without  UNDO  logging. 

When  a  transaction  updates  a  page,  that  page  can  be  written  back  to  the  database  without 
UNDO  logging  if  its  parity  group  is  clean  or  if  its  parity  group  is  dirty  and  the  update  is  for  the 
same  page  that  caused  the  group  to  move  into  the  dirty  stale,  i.e..  the  same  page  has  been  updated, 
stolen  from  the  buffer  then  rereferenced  l)y  the  same  transaction,  updated  and  stolen  again  from 
the  buffer  before  EOT^.  Note  that  this  does  not  affect  the  degree  of  concurrency  or  interfere  with 
the  locking  policy  used  in  the  system.  We  do  not  specify  when  a  transaction  can  or  cannot  modify 
a  page.  We  only  specify  when  a  modified  page  can  be  written  back  to  disk  without  UNDO  logging. 

If  a  single  parity  page  is  used,  then  when  a  group  becomes  dirty  the  old  parity  information  has 
to  be  kept  in  the  parity  page  to  be  able  to  recover  in  case  of  a  transaction  failure.  That  would 
mean  that  when  the  transaction  commits,  the  new  |)arity  has  to  be  recomputed  in  order  to  update 
the  parity  page.  That  would  require  reading  all  the  data  pages  in  the  group  in  order  to  compute 
the  new  parity.  To  avoid  that  ijroblem  a  twin  page  scheme  is  used  lor  liie  parity  pages.  1  he  basic 
mechanism  of  the  twin  page  scheme  is  as  follows;  one  of  the  parity  |)ages  always  contains  the  valid 

“Normally  siicli  an  ovciil  slioiiltl  iiol  oicni  ullcn  ^incc  bnlfc-r  miina!;ciiif'nl  .ili'oi  il  lnn>  .iic  not  Mipi'oseil  lo  replace 
a  page  that  will  be  rclereiueil  again  in  llic  near  tninre. 
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Figure  5;  Parity  striping  organization  with  the  twin  page  scheme  for  the  parity. 


parity  of  the  group  while  the  other  page  contains  obsolete  parity  information.  When  a  data  page  is 
modified  in  a  parity  group,  the  ol)solete  parity  page  (P  for  example)  is  updated  with  the  new  parity 
of  the  array.  If  the  transaction  performing  the  update  commits  then  the  modified  |)arily  page  (P) 
becomes  the  valid  parity  page  otherwise  the  other  parity  page  (P')  remains  the  valid  parity  page 
and  its  contents  are  used  to  recover  the  data  page  that  was  modified  by  the  failed  transaction. 
Figures  4  and  5  show  the  data  striping  organization  and  the  parity  striping  organization  when  the 
twin  page  scheme  is  used  for  the  parity.  Twin  parity  pages  are  denoted  Pi  and  P.t'  in  the  data 
striping  case  and  Pir/  and  Pcj/.  with  ;  =  (.r  +  l)inod(.V  +  2).  in  the  parity  striping  ca.se.  Figure  6 
shows  the  contents  of  a  parity  group  including  the  twin  parity  pages.  In  order  to  recover  the  old 
version  of  a  data  page  after  a  transaction  abort  it  is  sufficient  to  XOR  the  contents  of  both  parity 
pages  and  the  new  data  page:  Dg|j|  =  (P  -r  P')  Dnew-  When  a  parity  group  is  dirty  because 
one  of  its  data  pages  D,  has  been  stolen  from  the  buffer  and  another  page  Dj  needs  to  be  written 
to  disk.  UNDO  logging  miisr  bf'  [x-rformed  for  D, '  then  both  parity  pages  P  and  P'  need  to  be 

'Tlic  before-itnanc  of  I  lie  na^c  iti  the  i  .i>i  of  lonniii.n  or  of  tin-  mottiliirl  iii  t  lir  <  im-  iil  riroiil  lonf>inn 
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Figure  G;  TUe  contents  of  a  page  parity  group. 

updated  since  when  the  group  is  (.lirty  it  is  necessary  to  maintain  a  current  parity  page  reflecting 
tlie  actual  parity  of  the  data  on  disk  and  an  "old"  parity  ])age  that  would  l)e  used  to  recover  the 
uncommitted  data  page  D,  in  case  of  a  transaction  abort.  In  all  cases,  when  vvriting  a  data  page 
to  disk  the  corresponding  parity  page(s)  must  be  updated  first. 

4.2  Twin  Page  Management 

The  twin  parity  pages  are  stored  on  different  disks.  This  is  necessary  in  order  to  be  able  to  perform 
transaction  recovery  following  a  disk  failure.  In  order  to  identify  which  of  the  twin  parity  pages 
contains  the  valid  parity  information,  a  timestamp  is  stored  in  the  page  header.  The  page  with 
the  highest  timestamp  contains  the  valid  parity  information.  When  an  update  is  undone  after  a 
transaction  or  system  failure,  the  timestamp  of  the  current  parity  page  is  reset  to  0.  .Algorithm 
Current-Parity  shown  in  Figure  7  selects  the  current  parity  page.  When  a  data  page  is  updated 
both  parity  pages  are  read  and  one  of  them  is  selected  for  modification.  Then  the  parity  is  computed 
and  the  modified  i)arity  page  is  written  back  to  disk.  In  order  to  avoiil  reading  both  parity  pages, 
a  bit  map  can  be  maintained  in  main  memory  indicating  which  is  the  current  parity  page  for  each 
of  the  parity  groups  in  the  database.  However  such  a  bit  map  may  not  survive  a  system  crash. 
Hence  following  a  crash  that  destroys  the  mao,  algorithm  Current_Parity  will  have  to  be  used  to 
identify  the  current  parity  page  and  to  reconstruct  the  bit  map.  In  this  case,  two  bits  would  have 
to  be  used  in  the  bit  map  for  each  parity  group  to  code  the  three  possible  states:  parity  page  P  is 
the  current  parity  page,  parity  |)age  P'  is  the  current  parity  page  or  the  information  is  not  available 
and  algorithm  Current_Parity  has  to  be  used.  Following  a  system  crasli  a  liackgroimd  proiu'ss 
that  runs  during  idle  ijeriods  ol  the  system  can  be  initiated  to  reconstruct  the  bit  map. 
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C  urrent -Parity  ( [>» ) 
begin 

Road  twin  pariiy  pagos  in  parity  grou[)  [ig: 
if  Timestampi  P)  >  Timestamp(  P')  then 
Current-Parity  —  P: 

else 

Current-Parity  —  P': 

end 


Figure  7:  Algorithm  Current-Parity  determines  the  current  parity  page. 


C:  committed;  O:  obsolete;  I:  invalid;  W:  working 

Figure  8:  State  transition  diagram  of  the  twin  parity  page.s. 

Each  of  the  twin  parity  pages  can  be  in  one  of  four  states:  committed,  obsolete,  working  or 
invalid  [16].  A  parity  page  is  committed  when  it  contains  the  last  committed  parity  update.  It  is 
obsolete  when  it  contains  old  committed  parity  information.  It  is  in  the  working  state  when  it  has 
been  updated  by  an  active  transaction,  and  it  is  in  the  invalid  state  if  the  last  transaction  updating 
it  has  aborted.  Figure  8  shows  the  state  transition  diagram  of  the  twin  parity  pages. 


4.3  Recovery  from  System  Failure 


Following  a  system  crash  we  iieeil  to  identify  which  transactions  have  to  be  backed  out  and  which 
pages  have  been  modified  on  disk  by  those  transactions.  A  Begin-Of-Transaction  (BOX)  record 
needs  to  be  written  to  a  log  file  after  the  transaction  begins  and  before  it  writes  Irack  any  modified 
pages  to  disk  ami  an  FOX'  record  must  be  written  to  the  log  file  when  the  transaction  commits. 
Modified  database  pages  for  which  U.NDO  logging  has  been  performed,  can  be  recovered  by  reading 
their  before- images  from  the  log.  .Modified  ilatabase  pages  for  which  I'.N’DO  logging  has  not  been 
performed  can  be  recovered  using  the  parity  pages.  However  information  on  which  pages  have  been 
written  to  the  database  without  UNDO  logging  has  to  be  saved  in  permanent  storage.  Xo  solve 
this  problem,  a  technique  similar  to  the  one  used  in  TWXST  [11]  can  l)e  employed.  In  XWISX. 
a  twin  page  scheme  is  used  to  store  all  database  pages,  no  before-image  logging  is  performed  and 
the  same  problem  of  identifying  which  pages  to  undo  after  a  crash  is  encountered.  Xhe  solution 
makes  use  of  a  log  chain  which  consists  of  pointers  stored  in  the  page  headers  that  link  together 
pages  modified  by  the  same  active  transaction.  In  our  case,  only  modified  pages  written  back  to  the 
database  before  EOT  without  U.NDO  logging  will  be  part  of  the  log  chain.  The  head  of  the  chain 
though  has  to  be  logged  along  with  the  transaction  id.  1/0  operations  to  maintain  the  log  chain 
can  be  hidden  behind  regulai'  I/O  re(|iie>t>  and  ilo  not  affect  significantly  Nvstem  performance. 

5  Performance  Analysis 

In  order  to  evaluate  the  benefit  of  RD. A -recovery,  we  develop  an  analytical  model  to  evaluate 
transaction  throughput  for  different  algorithms.  Since  the  cost  of  maintaining  parity  information 
in  a  system  with  redundant  disk  arrays  is  relatively  high,  we  do  not  advocate  the  use  of  RD.As 
solely  lor  the  piirpuic  ol  siipporiing  l  r.ui.sacl ion  .iiul  cra.sh  I'ecucery.  We  look  .ii  i  he  bi'uelii  of 
using  RD.A  recovery  in  a  syst(’m  that  already  needs  RD.As  for  the  purpose  of  rapid  media  recovery. 
We  do  this  by  comparing  the  throiighpiil  of  systems  using  traditional  recovery  algorithms  and 
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reduntlaut  disk  ai'rays  to  sysU'ius  witli  iIk-  saiiu;  rofovc'ry  algorithnis  in  coinhination  with  RDA 
recovery.  VV'e  consider  botli  [)a«e  and  record  logging  and  in  each  ca.se  we  examine  two  different 
recovery  algorithms  and  t‘vain.ite  the  im|)rovement  achieved  lyv  adiling  RD.A  recovery  to  them.  .4^ 
far  ns  ■.tor.ng''  is  concerned,  the  extra  cost  involvc<l  in  using  RD.A  recovery  is  that  of  the  twin  page 
scheme  for  the  [)arity  which  is  (  L()0/.V)‘a  ol  the  initial  data  storage  cost. 

RDA  recovery  reduces  the  amount  of  t'.\DO  logging  and  hence  is  appropriite  for  systems  using 
update-in-place  which  implies  -i.d TO.U/C  propagation  and  a  STEAL  policy  for  page  replacement. 
We  therefore  restrict  ourselves  to  the  analysis  of  such  algorithms.  Within  this  class  of  algorithms 
we  examine  both  the  FORCE  and  -'FORCE  strategies  for  EOT-processing.  For  algorithms  of  the 
type  -•ATOMIC.  STEAL.  FORCE,  only  a  rOF' checkpointing  policy  makes  sense.  For  algorithms 
of  the  type  -•.ATOMK'.  STEAL.  -•FORCE,  both  .-K  ’tf’or  TCT' checkpoints  could  be  used  however 
algorithms  using  .ACC  checkpointing  were  shown  to  outperform  those  using  the  TC'Ctype'^  [12]. 
Hence  we  only  look  at  the  former  type  of  checkpointing. 

We  use  the  same  basic  model  as  the  one  introduced  by  Reuter  in  his  evaluation  of  the  perfor¬ 
mance  of  several  databa.se  recovery  techniciues  [12).  We  assume  that  the  system  is  1/0  bound  and 
therefore  we  look  only  at  the  numiter  of  1/0  re(|uests  required  to  perform  a  given  operation.  We 
also  assume  that  the  system  is  running  continuously  with  no  periodic  shutdown.  This  implies  that 
aU  cleanup  activities  required  by  the  algorithm  are  accounted  for  in  the  cost  calculations  instead 
of  assuming  they  are  performed  by  some  background  process  or  during  shutdowm  periods. 

The  workload  considered  consists  of  a  set  of  P  transactions  executing  concurrently  in  the  system 
Transactions  are  of  two  types:  n/Wn/c  or  irtritral.  The  fraction  of  update  transactions  is  /„.  Each 
transaction  accesses  .s  database  pages.  Ihe  fraction  of  accessed  jtiiges  that  .ire  .uodified  !)>'  an  update 
transaction  is  /;„.  I’o  ch.aracterize  the  behavior  of  the  database  bulb'r.  we  use  the  communality  C 
which  denotes  the  probability  tliat  a  i)age  r<'<i  ted  by  an  incoming  transaction  i-  ])resent  in  the 
buffer.  The  numlter  of  j)age  frames  in  the  buffer  is  denoted  by  P.  It  is  assumed  that  the  buffer 

',\lso  fee  clicckpoiiitiiin  ( oiil  I nilii  t>  mir  .isMiinpllDii  of  a  l■ollllUllullslv  riiiiiuiii;  svsicni  ^iiu  o  it  icuiiiri's  the 
establishment  of  a  (piiesceiu  point  where  no  update  transactions  are  present  in  tlie  svstem 
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is  sufficiently  large  so  that  once  a  transaction  has  referenced  a  page,  the  page  will  remain  in  the 
buffer  until  it  is  no  longer  needed  by  the  transaction®. 

The  cost  of  recovery  after  a  system  crash  is  denoted  by  Cj  and  is  measured  by  the  uuiid>er  of 
page  transfers  between  main  memory  ami  i  he  disk  subsystem  required  to  perform  recovery.  The 
cost  of  e.xecuting  a  transaction  is  denoted  by  c,.  The  transaction  throughput  is  defined  as  the 
number  of  transactions  processed  during  an  availability  interval.  .An  availability  interval  T  is  the 
period  between  two  system  crashes.  Since  all  cost  measures  are  evaluated  in  terms  of  number  of 
I/O  operations,  we  assume  that  the  availability  interval  is  measured  in  units  of  page  transfers'. 

If  checkpointing  is  used  then  the  length  of  a  checkpointing  interval  is  denoted  by  I  and  is  also 
measured  in  units  of  page  transfers.  The  cost  of  generating  a  checkpoint  is  denoted  by  c^-  .Assuming 
that  the  crash  occurs  in  the  middle  of  a  checkpointing  interval,  the  number  of  page  transfers  available 
for  processing  transactions  in  an  availability  interval  is  T  -  -  Cc[[T  -  Cs  -  I/2)/f).  Hence  the 

throughput  is  given  by: 

I't  =  ((7  -  cj(  I  -  Cc/I)  +  Ccl’2)/ct 

We  assume  that  c,.  is  independent  of  [.  Hence  the  optimal  checkpointing  interval  can  be  easily 
derived  from  the  following  eciuation  [12]: 

^  -®^'//)  +  (^-C,)(Ce//^))  =0.  (1) 

Let  Ct  denote  the  cost  of  updating  a  retrieval  transaction  and  c^  that  of  an  update  transaction. 
Then  C(  can  be  e.xpressed  as  follows: 


^  i  —  (1  fu  )^'r  T  fu^u 


Cr  it.self  includes  two  components:  the  cost  of  reading  pages  that  are  not  found  in  the  database 


''Tlie  page  coiikl  still  be  replaced  before  the  Iraiisaclioii  eoiniiiits  if  a  STEAL  policy  is  user!,  however  if  it  is  replaced 
will  not  be  rerefereiiced  In  the  Iraiisaclioii. 


Mai  lieiiialK  ally  /  c  .111  be  ileliiied  .i.s  follows  / 


leiiglll  of  availabilily  iiilerval  in  seioiids 
Imie  to  Iransler  a  page  lo/from  disk  in  seconds 


Id 


buffer  and  the  cost  of  writing  l)ack  tlie  ro[)laced  pages  if  they  have  been  modified.  Hence: 

cv  =  .■>■(  I  -  C)-h  as(  1  -  C )/A„ ,  (2) 

where  denotes  the  probability  tliat  the  replaced  page  was  modified  and  u  denotes  the  number 
of  page  transfers  necessary  to  perform  one  write  to  the  disk  array,  a  is  e([ual  to  3  or  4  depending 
on  whether  or  not  the  old  data  page  is  in  tlie  liuffer  at  the  time  of  writing  the  new  data.  For  c^. 
we  have  two  additional  components  which  represent  the  cost  of  logging  the  transaction  (c/)  and  the 
cost  of  backing  out  the  transaction  (cf,)  in  the  case  where  an  abort  occurs.  Hence: 

Cu  =  .s(  1  -  f ■)  +  n.s(  1  -  C)p,n  +Ci  +  P(,C6.  (3) 

where  pi,  denotes  the  probability  of  an  abort. 

5.1  Evaluation  of  the  Probability  of  Logging 

We  consider  a  set  of  A'  pages  that  have  been  modified  by  active  transactions  and  we  compute  the 
expected  value  of  the  size  of  the  subset  of  pages  that  can  be  written  back  to  the  database  without 
UNDO  logging.  .V  is  the  number  of  data  pages  in  a  parity  group  and  S  is  the  total  number  of  data 
pages  in  the  database.  We  assume  that  the  A  pages  are  randomly  chosen  from  the  S  pages  in  the 
database.  Note  that  by  using  data  striping  (RAID)  with  a  large  striping  unit  or  parity  striping,  any 
sequentiality  in  database  acces.ses  will  act  in  favor  of  our  .scheme  by  distributing  the  pages  accessed 
over  distinct  parity  groups. 

The  parity  groups  in  the  database  are  numbered  from  1  to  5/.V.  Let  .V,.  1  <  /  <  5/.V.  be 
the  random  variable  who.se  value  is  1  if  one  of  the  A'  pages  is  a  member  of  parity  group  i.  and  0 
otherwise.  Let  A'  be  the  random  variable  denoting  the  number  of  parity  groups  that  contain  all  A’ 
pages.  .V  is  also  the  number  of  pages  lluii  inn  bi'  directly  writti'ii  back  to  the  datnbasi'  sinci'  one 

page  per  parity  group  can  be  written  back.  V\’e  have: 

s/.V 

V  ,  ^  A-,. 
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Since  the  A’  pages  are  assumed  to  be  randomly  chosen,  each  parity  group  has  liie  same  probability 
of  being  accessed  by  those  A’  page  references.  Hence  the  -V,'s  are  identically  distributed.  Iherelore. 
the  e.xpected  value  of  .V  is  E[A']  =  Xlfif  E[.V,]  =  -^E[A'i].  Since  A'l  is  a  Bernoulli  random  variaide. 
E[A'i]  =  Pr(A'i  =  1)  and  E[A']  =  A'l  =  ())).  which  can  be  written:  E[.V]  =  ^  “  V^)' 

Hence  if  A’  modified,  "uncommitted”  pages  are  to  be  written  to  the  database,  the  probability  of 
having  to  log  one  of  those  pages  is  given  l)y: 


5.2  Page  Logging 

5.2.1  Algorithm  of  the  Type  -^ATOMIC,  STEAL,  FORCE,  TOC 

With  the  FORCE  discipline,  the  checkpoint  is  taken  at  the  end  of  each  transaction.  The  cost  of 
checkpointing  is  therefore  accounted  for  in  the  cost  of  logging.  In  the  model,  we  set  Cc  =  0.  Given 
our  assumption  that  pages  are  not  rereferenced  by  the  calling  transaction  after  they  have  been 
replaced  in  the  buffer,  the  cost  of  writing  and  logging  a  page  will  be  the  same  whether  the  page  is 
stolen  from  the  buffer  before  transaction  commit  or  whether  it  stays  in  the  buffer  until  EOT  and  is 
then  logged  and  svritten  to  tiie  databa.se.  Hence  we  will  account  for  all  the  costs  involved  in  logging 
the  pages  and  writing  them  back  to  the  database  as  part  of  the  cost  of  logging.  This  allows  us  to 
set  =  0  in  the  expressions  for  Cr  and  c^-  The  expression  for  ci  is: 

Ci  =  .3  X  .sp„  +  4  X  ( 2spu )  +  4  X  4 

The  first  term  is  the  cost  of  writing  the  pages  back  to  the  database.  Each  write  to  the  disk  array 
costs  three  I/O  operations  sim'e.  witli  the  |■^)1{(■|■’.  discipline,  the  old  data  is  ke])i  in  t  lie  buffer 
until  EOT  for  the  jiurpose  of  UNDO  logging.  The  second  term  is  the  cost  of  writing  to  the  U.N'DO 
and  REDO  log  files.  REDO  information  is  needed  only  in  the  case  where  an  operator  error  or 
a  system  software  error  damages  more  than  one  disk  in  the  disk  array.  The  log  files  are  stored 


separately  which  makes  reading  the  log  to  backont  aborted  transactions  less  costly.  The  last  term 
in  the  expression  of  c/  is  the  cost  of  writing  130T  and  EOT  records  to  each  of  the  log  files. 

The  probability  of  having  to  log  a  page  with  RD.-V  recovery  is  de|)endent  on  the  number  K  of 
pages  written  back  to  the  database  by  incomplete  transactions.  We  assume  that  when  a  transaction 
writes  back  a  page  to  the  database  before  committing,  the  other  concurrent  transactions  are  halfway 
through  writing  their  own  modified  pages.  Tlierefore  A'  is  equal  to  half  the  total  number  of  pages 
modified  by  concurrent  update  transactions.  Hence  the  probability  of  logging  is  given  by  Equation  4 
in  which  A'  is  replaced  with®  P-^fuPu/'^-  VVith  RD.4  recovery,  the  formula  for  the  cost  of  logging 
becomes: 

=  (3  +  2pt)spu  +  4(.s?^„  +  spuPi  +  4)  +  4(pi  -  p"^'‘ ) 

The  major  difference  with  c/  is  that  U.N’DO  logging  has  to  be  performed  only  when  the  parity- 
group  is  dirty,  i.e.,  with  probability  pi.  The  term  2pi  is  added  to  3  to  account  for  the  fact  that 
when  writing  to  a  dirty  parity  group  both  parity  pages  need  to  be  updated^.  The  last  term  in  the 
expression  of  c\  denotes  the  cost  of  writing  the  log  chain  header  to  the  log.  The  header  is  normally 
written  along  with  the  BOT  record  in  the  same  page  except  when  the  first  page  written  by  the 
transaction  to  the  database  has  to  be  logged  and  not  all  pages  updated  by  the  transaction  have  to 
be  logged. 

To  evaluate  ct,  we  assume  that  a  transaction  aborts  in  the  middle  of  processing  its  pages  and 
that  the  other  concurrent  update  transactions  have  also  logged  half  their  modified  pages.  The 
UNDO  log  has  to  be  read  up  to  the  BOT  record  of  the  aborting  transaction. 

C(;  =  ( p„s/'2 ){  P f„)  +  P/u  +  4( p„s/2 )  4-  4 

The  first  term  is  the  number  of  before-images  that  have  to  be  read  from  the  log.  The  second  term 
is  the  number  of  BOT/EOT  i-ecords  to  bi*  read.  The  third  term  is  the  number  of  |)age  transfers 

"P.-ige  logging  implies  the  use  of  page  locking  anil  liciiie  llie  sets  of  pages  moilitied  by  i  on<  nrrenl  update  transactions 
are  disjoint. 

^VVe  assume  that  log  (ile  pages  and  data  pages  are  not  mixed  in  the  same  parity  groups. 
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to  and  from  the  tlatabasc  to  undo  the  mudilicatioiis  performed  l)y  the  aborting  transaction  and 
the  last  term  accounts  for  llie  writing  of  a  rollback  record.  VVith  RD.-\  recovery  t  lie  above  formula 
becomes: 


c'b  =  (PnPl^/'2)Pfu  +  {pi  -  +  {Pu-^/'^K^nP  +  5(  1  -/;/])  +  4 

In  the  first  term  the  number  of  logged  before-images  to  be  read  is  now  multiplied  by  pi.  The 
second  term  is  the  expected  number  of  log  chain  headers  to  be  read  from  the  log.  The  other  major 
difference  is  in  the  fourth  term.  It  is  due  to  the  fact  that,  when  recovering  a  page  that  has  been 
logged,  up  to  six  I/O  operations  might  be  necessary  since  its  parity  group  may  still  be  dirty“^.  On 
the  other  hand,  if  the  page  has  been  written  to  the  database  without  being  logged,  it  is  necessary 
to  read  both  parity  pages  in  its  parity  group  and  the  "new'’  data  page  and  then  overwrite  the 
database  page  with  tiie  old  data  and  modify  the  state  of  the  parity  page  from  workintj  to  invalid 
by  resetting  the  timestamp  in  its  header.  Hence  five  I/O  operations  will  be  necessary  in  the  latter 
case. 

After  a  system  crash,  only  U.NDO  recovery  needs  to  be  performed.  Hence  the  formula  for 
contains  the  cost  of  reading  the  UNDO  log  file  up  to  the  BOT  record  of  the  oldest  transaction  alive 
at  the  time  of  the  crash  and  then  overwriting  the  modifications.  The  work  of  the  oldest  transaction 
alive  overlapped  with  the  work  of  some  committed  transactions  therefore  the  log  records  for  half 
the  work  of  about  '2Pfu  transactions  need  to  be  read.  Hence  the  expressions  for  Cj  and  c^  are: 

('3  —  P fui^Pu  +  2)  -k  4(  P /uPu-^/2) 

c's  =  Plni  VhdP  +  ■2(W  -  pT"  )  +  ’2)  +  Pf,APui>f2)(-\pi  -k  o(  1  -  p())  S/.\ 

The  term  .S'/.V  is  an  upper  bound  for  tlte  cost  of  reconstructing  the  bit  map  for  the  current  parity 
page. 

’’’In  tills  iiKstaiice  iiiul  In  oilier  iii.''l aiu  es  in  llie  evainalion.  wr-  use  an  upper  bonml  lor  llie  eosis  nnolved  In  KD-X 
recovery  in  order  to  keep  lliini;s  simple.  Tills  will  le.id  lo  a  con.seivalive  estimale  of  the  Ix  nelil  of  onr  im  lliod 
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High  update  fitHiuency 


High  relrie\al  frecpiency 


Figure  9:  Results  for  -^ATOMIC,  STEAL,  FORCE.  TOC 


We  evaluate  the  algorithms  in  two  different  environments  depending  on  the  frequency  of  update 
transactions.  Figure  9  shows  the  throughput  as  a  function  of  the  coinmunality  C  in  a  system  with 
high  update  fretiuency  and  in  a  system  with  high  retrieval  frequency.  .As  e.xpected  the  improvement 
in  throughput  using  RD.A  recovery  is  much  more  .significant  in  the  high  update  frequency  environ¬ 
ment.  For  the  latter  environment  and  for  C  -  0.9  the  increase  in  throughput  is  about  42%.  .All  the 
values  for  the  different  parameters  of  the  model,  e.xcept  for  N ,  were  taken  from  [12].  These  values 
are:  B  =  300,  5  =  .5000,  N  =  10.  P  =  6.  pf,  =  0.01  and  T  =  -5.10®.  For  the  high  update  frequency- 
environment,  s  =  10.  /ij  =  0.8  and  =  0.9  while  for  the  high  retrieval  frequency  environment. 
s  =  40.  /u  =  0.1  and  =  0.3. 


5  2.2  Algorithm  of  the  Type  -^ATOMIC,  STEAL,  -'FORCE,  ACC 

In  this  case,  at  EOT.  before-  and  after-images  of  modified  pages  are  written  to  the  log  but  the 
modified  pages  are  not  written  back  to  the  databa-se.  They  remain  in  the  buffer  until  they  'ir': 
replaced.  HEDO  lecovi'iy  has  to  be  performed  after  a  system  crash  and  .1  (  Y  ' checkpointing  is  used 
to  reduce  the  amount  of  REDO  duritig  crash  recovery. 

First  we  need  to  evaluate  p,„.  I'o  do  .so.  we  neerl  to  compute  the  number  of  transactions  that 
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successively  reference  a  page  during  its  life  in  the  database  buffer.  If  we  look  at  the  stream  of 
references  to  a  page  by  successive  transactions  we  can  see  that  with  probability  C  the  page  is 
referenced  when  it  is  in  the  l)ulfer  and  with  probabilii.y  I  -  C  it  is  referenced  when  it  i^  nor  in  tlie 
buffer.  Hence  the  number  of  references  to  the  page  during  its  life  in  the  buffer  follows  a  geometric 
distribution  with  parameter  C  which  implies  that  the  average  number  of  references  to  the  page 
while  it  is  in  the  buffer  is  1/(1  —  C).  Since  the  probability  of  a  page  being  modified  by  a  transaction 
that  references  it  is  /„p„.  the  probability  of  a  replaced  page  being  modified  during  its  life  in  the 
buffer  is^': 

=  1  -  (I 

The  cost  of  logging  is  simply  the  cost  of  writing  before-  and  after-images  of  modified  pages  and 
the  BOT/EOT  records  to  the  log: 

Cl  =  4(2.?p„  -f  2). 

With  RDA  recovery,  pages  that  have  been  stolen  from  the  buffer  before  EOT  do  not  have  to  have 
their  before-images  logged.  Therefore  we  need  to  evaluate  the  p  obability  ps  for  a  page  being  stolen. 
The  number  of  references  that  could  cau.se  a  given  page  to  be  stolen  is  (1  -  C)s(P  -  1)  and  the 
probability  that  any  one  of  those  references  causes  the  replacement  of  the  page  is  l/[B  -  Cs). 
Hence  the  formula  for  p,  is: 

1  yi-r)5(P-i) 

n  -  c-.J 

In  the  formula  for  p/.  tlie  value  of  A  is  /''•'</„ p,,p.,/2.  The  before-image  of  a  modified  page  will  not 
be  logged  with  probability  psH  -p/).  Hence  the  cost  of  logging  with  RD.A.  recovery  is: 

c'l  =  4(sp^  -f  .spu(  1  -  p,,(  1  -  p/))  -I-  2)  -I-  4(p,  -  ). 

For  the  cost  of  l)acking  out  a  tr.nn^action.  one  difference  with  the  FORCE  x  lieme  is  that  the 
log  file  contains  l)oth  Ix'lbre-  ttiid  after-images  which  will  be  reiid  until  tlie  lU)  f  record  of  the 

"Tlie  same  equation  for  /),„  wa-s  derived  in  [!.']  u.siiit>  a  .slij-htly  dilfereiit  argument. 
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aborting  transaction  is  found.  .Another  difference  is  that,  with  probability  C .  tlie  modified  pages 
to  be  undone  are  still  In  the  buffer.  Hence: 

C6  =  2  X  (p„6/2)(  FJ\.  )  +  Ff„  +  4pJ^r2){  1  -  C)  +  4 

With  RDA  recovery,  the  cost  of  transaction  backout  becomes: 

c'i,  =  2x(p^s/2){Pfu)+PU+Pfu{Pi-p\"’’'‘’'’h+Pu{s/‘2){(4+2pi)(l-C)(l-ps)+GpsPi+ops{l-pt))+4 
The  cost  of  performing  a  checkpoint  for  -iRD.A  and  for  RDA  is  given  by: 

t'e  =  -If  +  -)' 
c'^  =  (4  +  2pi)[Bpm  +  2). 

To  evaluate  the  cost  of  recovery  after  a  crash,  we  assume  that  a  crash  occurs  in  the  middle  of  a 
checkpoint  interval.  .411  transactions  e.xecuted  since  the  last  checkpoint  have  to  be  redone.  Let 
denote  the  number  of  transactions  e.xecutecl  during-  a.  checkpoint  intervaJ.  is  given  by  =  //c, 
and  the  expression  for  c.,  is: 

c,  =  ('■c/2)/u(C(/4  +  4.S7;,.)  +  /"/u(c//4  +  4(s/2)pu  -  1) 

The  -I  term  corresponds  to  the  EOT  record  which  is  accounted  for  in  c;/4  but  is  not  read.  The 
cost  of  recovery  from  a  crtLsh  with  the  RD.4  recovery  technique  is: 

C3  =  (r'/2)/u(c;/4  +  Asp,,)  +  Pfu{c'i/-\  +  (.s/2)/;„(4(  I  -  Ps)  +  Apspi  +  op^f  1  -  p;))  -  1)  +  5/.V. 

The  value  of  the  optimal  checkpointing  interval  f  is  obtained  by  plugging  the  expression  for  in 
Equation  1.  This  yields: 

/  =  {2ctcAT  -  PfAci  +  Ms/2)p„)  -  Pfu)l(fu[ci  +  4sp„)))'/^ 

The  formula  for  I  in  the  case  of  RD.\  recovery  is  derive<i  in  a  similar  fashion.  The  value  of  o  in  the 
expressions  of  c,.  and  r„  is  I  for  -■HD.A  and  I  +  2p/  for  RD.A  because  with  I  In'  ^  Tt.l/i’t  7:' <liscipline. 
when  replacement  takes  place  the  old  version  of  the  data  is  not  available  any  more  in  the  bulfer. 
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Figure  10:  Results  for  -^ATOMIC,  STEAL.  ^FORCE.  ACC 

Figure  10  shows  the  results  for  both  environments.  It  can  be  seen  that  the  improvement  is 
not  significant  in  this  case.  However  the  interesting  result  is  that  while  without  RD.A.  recovery, 
the  -^FORCE.  dC'Ctype  algorithm  outperforms  the  FORCE.  TOC  scheme,  when  RD.-\  recovery 
is  used,  the  situation  is  reversed  and  the  latter  algorithm  outperforms  the  former  by  a.  significant 
margin. 


5.3  Record  Logging 

In  this  section  we  look  at  recovery  algorithms  in  which  only  modified  records  are  logged.  The 
unit  of  transfer  between  main  memory  and  secondary  storage  is  still  a  page  however,  when  logging 
is  performed,  logged  records  are  encapsulated  into  pages  and  then  written  to  the  log  file.  Some 
additional  parameters  of  the  system  need  to  be  introduced  for  the  analysis  of  record  logging:  d 
denotes  the  number  of  update  statements  per  transaction:  r  denotes  the  average  length  (in  bytes) 
of  a  long  log  entry  such  as  a  data  record:  e  denotes  the  average  length  of  a  short  log  entry  such 
as  a  table  entry:  1/,^  denotes  the  length  of  the  HOT  and  EOT  records:  denotes  the  length  of  a 
physical  page:  //,  denotes  the  length  of  a  log  chain  header.  The  values  for  the  first  five  parameters 
arc  taken  from  [12].  These  values  ar(’:  d  =  d  for  high  update  frecpieiK  v  ('iivironments  and  d  =  S 
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for  low  update  frequency  environineuts,  r  =  100.  e  =  10,  li,c  —  16  and  Ip  =  2020.  The  value  for  In 
was  set  to  4.  .Assuming  that  each  update  statement  causes  one  long  log  entry  and  that  .s  >  d.  the 
average  length  of  a  log  entry  can  he  derived  ;^12]: 


L  =  [<tr  +  -  d)e)/.s. 

5.3.1  Algorithm  of  the  Type  -'ATOMIC,  STEAL,  FORCE,  TOC 

With  record  logging,  the  locking  granule  can  be  less  than  a  page.  We  assume  that  record  locking 
is  used  in  order  to  enhance  concurrency.  This  implies  that  the  total  number  of  pages  modified  by 
a  given  set  of  P  concurrent  transaction  is  not  the  same  as  for  the  above  algorithms  for  which  page 
locking  was  assumed.  We  will  denote  this  number  by  s„.  An  expression  for  .s^  is  derived  in  the 
Appendix.  The  value  of  A'  in  the  expression  of  pi  is  Su/‘2.  We  assume  that  group  commit  is  used 
so  that  log  records  from  different  transactions  can  be  grouped  in  the  same  page  and  written  to  the 

log.  The  derivations  of  the  cost  equations  are  .similar  to  those  in  Section  .5.2.1.  We  simply  list  the 

equations  without  detailed  explanation. 

Cl  =  ^spu  +  4  X  2(2/f,c  +  ■•^Pu(lbc  +  L))llp 

c'l  —  ( 3  +  '2pi  )spn  +  4(  2/j,c  +  spu  ( Ihc  +  ^ )  )/lp  +  4(  'like  +  spui  he  +  L  )pi  4-  ( +  h  )(Pi  —  P^^'‘ ) )/ h 

Cb  =  +  sPulZ-bc  +  ^)/2)//p  +  4(/j„s/2)  +  4 

4  =  Pfu(hc  +  spjhc  +  L)pi/2  +  (Ihc  +  h)(Pi  -  Pr“))/4  +  iPus/2)(6pi  4-  -5(1  -  pi))  4-  4 

C,  =  Pfu{2lbc  +  ^Pu(hc  +  L))/lp  +  4f’/u(7J„s/2) 

c'^  =  P fu{2lbc  +  ^PtAhe  +  L)pi  A-  2(/(,i.  4-  lb  ){pi  —  +  ( P fuPu^/2)(-ipi  4-  5(  1  —  p;)) 

Figure  11  shows  the  throughput  for  the  FORCE.  TOC  type  of  algorithms  with  and  without  RD.A 
recovery  as  a  function  of  the  communality  in  the  buffer  for  the  ca.se  of  record  logging. 


5.3.2  Algorithm  of  the  Type  ^ATOMIC,  STEAL,  -'FORCE,  .4CC 

The  cost  eciuations  for  this  case  can  lie  derived  using  the  results  of  Sections  .5.2.2  and  -5.3.1.  The 
value  of  h  in  the  expri'ssion  for  p/  is  .s„p,/2. 
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Figure  11:  Results  for  -‘ATOMIC.  STEAL.  FORCE.  TOC.  in  the  case  of  record  logging. 


Cl  =  ■iC2lbc  +  spuikc  +  '2L))/lp 

c'l  =  M2lbc  +  +  L(2  -ps(l-  pi)))  +  (4c  +  lh){pi  -  ))//p 

C6  =  Pfu{ci/S)  +  4pu(s/2)(i-C)  +  4 

c'b  =  PfiAc'i/S)  +  p„(.s/2)((4  +  2pt){  I  -  C)(l  -  Ps)  +  fipsPt  +  ypsil  -  Pi))  +  4 

Cs  =  {i'cl2)f,Act/4  +  4.s/;,J  +  P/„(q/4  +  4pu(.s/2)) 

c'  =  ( rj2  )/„ ( c{/4  +  4.s/;„  )  +  P/„ ( c',/4  4-  p„ ( .s/2 )( 5p^ ( 1  -  p/ )  4-  4(  1  -  1  -  P/ ) ) ) ) 

The  equations  for  c^-  and  c',  are  the  same  as  in  Section  5.2.2.  The  equations  for  Cr  and  need  to 
be  modified  to  account  for  the  e.xtra  cost  involved  in  logging  modified  records  in  pages  stolen  from 
the  buffer  before  EOT.  The  modified  record  of  a  stolen  page  needs  to  be  written  to  the  log  before 
the  page  can  be  replaced.  Let  p,  denote  the  proportion  of  replaced  pages  modified  by  uncommitted 
transactions.  We  have  p,  =  s’J{B  -  Cs).  where  is  the  number  of  pages  in  the  buffer  modified 
by  the  concurrently  e.xecuting  transactions  as  seen  by  an  incoming  transaction.  .s‘  is  obtained  by- 
replacing  P  with  P  -  1  in  the  e.xpression  for  •■<„.  This  gives  the  following  equations  for  Cr  and  c(.. 
the  ecpiations  for  c„  and  are  ol)ta.inpd  in  a  similar  fashion: 

r,  =  ,.(1  -  r')+  l.s(l  -r-)(p,„  +-2p,) 

r'.  =  •■'(  1  -  C)  4-  l.^(  1  -  r)(p„,  4-  2p,pi ) 
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Figure  12  shows  the  throughput  for  the  -^FORCE.  ACC  type  of  algorithms  with  and  without 
RDA  recovery  as  a  function  of  the  comniunality  in  the  buffer  for  both  evaluation  environments. 
Unlike  the  page  logging  case,  -'FORCE,  ACC scheme  performs  much  better  than  the  FORCE.  TOC 
scheme  for  the  range  of  values  of  C  encountered  in  typical  applications  [2].  Also,  for  the  -^FORCE. 
ACC  algorithm,  the  increase  in  throughput  achieved  by  using  RD.A  recovery  is  higher  than  for  the 
same  algorithm  with  page  logging,  riiis  is  (he  case  because,  with  record  logging,  the  cost  of  logging 
the  updates  of  a  stolen  page  is  high  relatively  to  the  cost  of  logging  non  stolen  pages  and  RD.\ 
recovery  reduces  that  cost  by  eliminating  the  need  for  logging  in  most  cases.  For  e.xample.  for  the 
high  update  frequency  environment  and  for  C  =  0.9,  the  increase  in  throughput  is  about  14%. 
The  benefit  of  RDA  recovery  increases  with  the  amount  of  work  performed  by  each  transaction. 
Figure  13  shows  the  percent  increase  in  throughput  achieved  by  RD.4  recovery  as  a  function  of  the 
number  of  pages  accessed  by  each  transaction  (•'i)  for  the  high  update  frequency  environment  with 
C  =  0.9. 
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Figure  13:  Benefit  of  RDA  recovery  as  a  function  of  the  number  of  pages  referenced  by  a  transaction. 

6  Experimental  Evaluation  of  FORCE.  TOC  Algorithms 

VVe  have  conducted  some  experiments  to  corroborate  the  findings  of  our  analytical  model.  VV’e  have 
used  data  from  an  operational  OLTP  system,  namely  a  Tandem  .N'onStop  system.  The  data  were 
extracted  from  log  files  generated  by  the  Transaction  .VIonitoring  Facility  (T.MF)[lo]  during  normal 
system  operation.  The  log  entries  contained  transaction  status  informat ii>n.  before  and  ".fter  images 
of  modified  data,  names  of  accessed  files  and  disks  as  well  as  timing  information.  Using  the  log 
entries,  we  constructed  a  trace  of  update  accesses  performed  by  each  transactio..  before  it  commits 
or  aborts. 

Using  this  data,  we  simulated  the  behavior  of  the  database  buffer,  the  recovery  algorithm  and 
the  I/O  subsystem.  .\s  in  the  analytical  model,  we  a.s.sumed  that  the  system  was  I/O  bound  and 
hence  we  ignored  cpu  processing  times  and  accounted  only  for  the  cost  of  performing  1/0.  However 
in  the  simulations,  we  did  not  simply  count  the  number  of  I/O  operations  performed,  but  rather  we 
simulated  the  execution  of  the  I/O  recpiests  in  the  disk  array.  We  have  simulated  a  parity  striping 
(jiganizal  ion.  .Since  l  hi' dal  a  ilid  iidI  i  oiil  ain  .in\  mullililock  relerence',.  we  .'X|)ei  i  i  he  perlorniance 
of  a  data  striping  organization  to  bi'  similar  to  that  of  parity  striping.  I'he  disk  iiarameters  used 
in  the  simidations  are  shown  in  I'aijle  1.  I'lie  database  a<ces.sed  in  our  trace  resiih'd  on  live  disks. 
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Tabic  1:  Disk  parameters. 


.\Ia.\.  latency 

Ifi.O  ms 

.Ma.x.  seek 

17  ms 

tracks  per  platter 

12b0 

sectors  per  track 

52 

bytes  per  sector 

512 

number  of  platters 

15 

We  assumed  that  the  log  file  is  stored  on  a.  separate  disk  array. 

The  data  available  to  us  did  not  include  any  read  access  information.  Thus  it  was  not  possible 
to  use  it  to  evaluate  -‘FORCE.  .ITT' algorithms  because,  for  these  algorithms,  the  improvement 
afforded  by  RD.A  recovery  is  dependent  on  the  frequency  of  replacement  of  modified  pages  and 
hence  requires  a  detailed  simulation  of  the  buffer  behavior.  Without  a  trace  of  read  accesses  we 
could  not  obtain  reliable  simulation  results  for  those  algorithms.  However,  for  FORCE.  TOC 
algorithms,  page  replacement  does  not  affect  the  cost  of  recovery  operations  as  much  as  in  the 
-‘FORCE,  .ACC case.  Therefore  we  were  able  to  u.se  the  available  data  to  obtain  simulation  results 
for  such  algorithms.  Since  our  analytical  model  did  not  show  much  promise  for  RD.\  recovery  in 
the  case  of  FORCE,  TOC  algorithms  with  record  logging,  we  concentrated  instead  on  page  logging. 
We  did  not  simulate  recovery  from  system  crash  since  none  occurred  in  the  interval  during  which 
the  data  was  collected.  Hence  the  throughput  r<  of  the  system  was  taken  to  be  the  reciprocal  of  the 
average  cost  per  transaction  C(.  C;  was  measured  as  the  total  cost  (disk  usage  time)  of  e.xecuting 
the  I/O  requests  in  the  disk  array  system  divided  by  the  number  of  transactions.  Figure  14  shows 
the  measured  throughput''*  of  FORCE.  /CC  algorithms  with  and  without  RD.A  recovery  in  the 
case  of  page  logging  as  a  function  of  the  number  (fi)  of  frames  in  the  buffer. 

.An  LRU  buffer  replacement  policy  was  assumed.  The  hit  ratio  ranged  from  77%  for  B  =  .50 
to  97%.  for  D  =  200.  The  improvement  in  throughput  decreases  from  ;}9%  lor  B  =  50  to  2S%  for 
B  =  100  and  then  increases  slowly  as  B  increases  to  reach  about  20%  for  B  =  200.  I’he  r('ason  for 

'^For  clarity,  we  mull  iplicd  the  I  liroiii’lipiil  valiio  hy  a  constant  factor. 
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Figure  14:  Empirical  results  for  FORCE.  TOC  algorithms  with  page  logging. 

the  decrease  in  the  first  part  of  the  curve  is  that  for  small  values  of  B,  some  pages  in  the  buffer  are 
replaced  more  than  once  which  increases  the  amount  of  logging  in  the  ->RD.4  case.  For  B  >  100. 
pages  in  the  buffer  are  replaced  at  most  once,  hence  the  amount  of  logging  remains  about  constant 
as  B  increases  while  the  amount  of  I/O  to  the  database  continues  to  decrease.  Hence,  as  B  goes 
from  100  to  200.  the  savings  due  to  RD.A  recovery  become  more  significant  relative  to  the  overall 
I/O  cost. 

7  Conclusions 

In  this  paper,  we  have  presented  a  scheme  that  uses  redundant  disk  arrays  to  achieve  rapid  recovery 
from  media  failures  in  (lataba.se  systems  and  simultaneously  provide  support  for  recovery  from 
transaction  aborts  and  system  crashes.  The  redundancy  present  in  the  array  is  exploited  to  allow 
a  large  fraction  of  pages  modified  l)y  active  transactions  to  be  written  to  disk  and  updated  in  place 
without  the  need  for  undo  logging  thus  reducing  the  numlter  of  recovery  actions  performed  by  the 
recovery  component,  fhe  m<?thod  use's  a  twin  |)age  scheme  to  store  the  parity  information  so  that 
it  can  be  efficiently  used  in  transaction  undo  recovery.  Fhe  extra  storage  used  is  about  (  1  ()()/. \  )'( 
of  the  size  (jf  th('  database*.  .\  being  the  ninnlx'r  of  disks  in  the  array. 
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We  used  a  detailed  analytical  model  to  evaluate  the  benefit  of  our  scheme  iii  a  system  equipped 
with  redundant  disk  arrays.  We  found  that,  in  the  case  of  page  logging,  a  FORCE.  TOC  algorithm 
combined  with  RD.-V  recovery  signiiicant ly  outperforms  a  FORCE.  TOC  algorithm  without  RDA 
recovery  as  well  as  -^FORCE.  .-ICC  type  of  algorithms.  In  the  case  of  record  logging,  we  found  that 
a  -‘FORCE,  .dCC  algorithm  performs  best  and  that  the  addition  of  RD.\  recovery  to  it  improves 
significantly  its  performance  especially  for  transactions  with  a  large  number  of  updated  pages.  We 
also  performed  simulations  using  data  from  an  operational  OLTP  system  to  validate  some  of  the 
results  of  the  analytical  model. 

Appendix 

Derivation  of  the  Formula  for  Su 

5u  is  the  number  of  pages  in  the  buffer  updated  by  a  set  of  P  concurrent  transactions.  Let  5*^'* 
denote  the  number  of  pages  in  the  buffer  updated  by  k  update  transactions.  Since  there  are  Pfu 
update  transactions  e.xecuting  concurrently  in  the  system,  we  have  If  we  number  the 

Pfu  update  transaction  from  1  to  P/„  in  the  order  of  their  entry  in  the  system  then  when  the 
A:th  update  transaction  enters  the  system,  it  will  find  Csp,,  of  the  .sp,,  pages  it  needs  to  modify 
already  in  the  buffer.  We  make  the  assumption  that  out  of  those  pages.  C-fp,,  '■<  I B  belong 

to  the  k  -  I  update  transaction  already  executing  in  the  system*^.  Hence,  we  have  the  following 
recurrence  equation; 

Using  5*'*  =  spu.  we  obtain  .s„  =  =  p(  1  -  (  1  -  Csp,,/ ). 

’’Update  transaction>  can  sliare  pastes  because  record  is  used  instead  of  pane  lonniui; 


•2S 


Acknowledgements 


The  authors  would  like  to  thank  Robert  Horst.  .)iin  (Parley,  Srikanth  Shoroff.  Robert  van  der  Linden. 

and  Susan  Whitford  of  Tandem  Corporation  for  providing  the  TMF  audit  tapes  and  for  answering 

numerous  ([uestions. 

References 

[1]  Bitton,  D.,  and  Gray,  J.  Disk  shadowing.  In  Proceedings  of  the  Ifth  International  Conference 
on  Very  Large  Data  Bases  (Sept.  1988),  pp.  331-338. 

[2]  EfFelsberg,  VV.,  and  Haerder.  T.  Principles  of  database  buffer  management.  ACM  Transactions 
on  Database  Sijstems  9.  4  (Dec.  1984).  •5()0--59.5. 

[3]  Gray,  J.,  Horst.  B..  and  Walker,  .VI.  Parity  striping  of  disk  arrays:  Low-cost  reliable  storage 
with  acceptable  throughput.  In  Proceedings  of  the  16th  International  Conference  on  Very 
Large  Data  Bases  (.A,ug.  1990),  pp.  148- Ibl. 

[4]  Gray,  J.,  McJones,  P..  Blasgen.  M..  Lindsay.  B..  Lorie.  R..  Price.  T..  Putzolu.  F..  and  Traiger. 
I.  The  recovery  manager  of  the  svstem  R  database  manager.  .4CA/  Computing  Surveys  13.  2 
(1981),  223-242. 

[5]  Haerder,  T.,  and  Reuter.  .4.  Principles  of  transaction-oriented  database  recovery.  ACM  Com¬ 
puting  Surveys  lo.  4  (Dec.  1983).  287-317. 

[6]  Kent,  .1..  and  Garcia-Molina.  H.  Optimizing  shadow  recovery  algorithms.  IEEE  Trans.  Softw. 
Eng.  14.  2  (Feb.  1988).  1.5.5-168. 

[7]  Kim,  M.  Y.  Synchronized  disk  interleaving.  IEEE  Trans.  Comput.  C-Sb.  11  (Nov.  1986), 
978-988. 

[8]  Livny,  M.,  Khoshafian,  S..  and  Boral.  H.  Multi-disk  management  algorithms.  In  Proceedings 
of  the  ACM  Signietrics  Conference  on  Measurement  and  Modeling  of  Computer  .Systems  (May 
1987),  pp.  69-77. 

[9]  Lorie,  R.  A.  Physical  integrity  in  a  large  segmented  database.  ACM  Trans.  Databenie  Syst.  2. 
1  (Mar.  1977),  91-104. 

[10]  Patterson.  D..  Gibson.  G.,  and  Katz.  R.  .4  ca.se  for  redundant  arrays  of  ine.xpensive  disks 
(RAID).  In  Proceedings  fjf  the  .ACM  SIC  MOD  Conference  [.Iwne  1988).  pp.  109-116. 

[11]  Reuter.  .4.  .4  fast  transaction-oriented  logging  scheme  for  UNDO  recovery.  IEEE  Trans.  Softw. 
Eng.  SE-6.  I  (.July  19.80).  348  356. 

[12]  Router.  .4.  Pertormance  analysis  of  r('coverv  technii|ues.  .\('.\I  Tran.->nrlion.^  on  Dalabase 
Systems  9.  4  (  Dec.  198  1 ).  526  559. 

[13]  Salem.  K..  and  Garcia-.Moliiia.  11.  Disk  striping.  In  Proceedings  of  tin  IEEE  International 
Conference  on  Data  Enginee  ring  (Feb.  1986),  pp.  336  .342. 


29 


[14]  Stouebraker.  M..  Katz.  R..  PaittMson.  D..  and  Ousterhout.  .].  The  design  of  XPRS.  In 
Proceedings  of  the  L{th  Intermtlional  Conference  on  Verxy  Large  Data  Bases  (Sept.  1988). 
pp.  318-330. 

[15]  Tandem  Computers  Inc.  Introduction  to  the  Transaction  Monitoring  Facility  (TMF).  ('u])ei'- 
tino.  CA.  Part  Number  15996. 

[16]  Wu,  K.-L.,  and  Fuchs.  VV.  K.  Rapid  transaction-undo  recovery  using  twin-page  storage  man¬ 
agement.  In  Proceedings  of  IEEE  Compsac  (Nov.  1990).  pp.  295-300. 


